tornado_data <- read.csv("./data/1950-2023_actual_tornadoes.csv", na.strings = c("NA", "N/A", " "))
cleaned <- tornado_data |>
filter(yr >= 2000) |>
select(-tz, -stf, -ns, -sn, -sg, -f1, -f2, -f3, -f4) |>
relocate(fc, .after = mag)|>
drop_na()
write.csv(cleaned, "cleaned_df.csv", row.names = FALSE)
# Pruning
best_cp <- tree_model$cptable[which.min(tree_model$cptable[, "xerror"]), "CP"]
pruned_tree <- prune(tree_model, cp = best_cp)
rpart.plot(pruned_tree, type = 3, extra = 101, fallen.leaves = TRUE)
The pruned decision tree identifies magnitude (mag) as
the primary factor influencing tornado-related injuries. Tornadoes with
mag < 4 predict minimal injuries, averaging 0.2 injuries
in most cases, indicating that low-magnitude tornadoes pose minimal
risk. For tornadoes with mag >= 4, injuries increase
substantially, with additional splits showing that long paths
(len >= 75 miles) and high property loss
(loss >= 23) are associated with severe injury outcomes.
This hierarchy underscores the dominant role of magnitude, with path
length and property loss acting as secondary risk factors in extreme
scenarios.
# Cross-Validation
set.seed(123)
train_control <- trainControl(method = "cv", number = 10)
cv_model <- train(
inj ~ mag + wid + len + loss,
data = cleaned,
method = "rpart",
trControl = train_control,
tuneLength = 10
)
rpart.plot(cv_model$finalModel, type = 3, extra = 101, fallen.leaves = TRUE)
The cross-validated tree confirms that magnitude (mag)
is the most critical factor, with injuries significantly increasing for
mag >= 4. This model captures more nuanced relationships
among variables, such as the interplay between path length
(len) and property loss (loss). Long paths
(len >= 75) consistently predict the highest injury
counts, while shorter paths refine the predictions based on magnitude
and loss. This suggests that while cross-validation slightly improves
predictive accuracy, the variable hierarchy remains consistent with the
pruned tree.
tree_model$variable.importance
## mag len loss wid
## 947829.455 605831.823 63904.583 9161.009
cv_model$finalModel$variable.importance
## mag len loss wid
## 947829.455 605831.823 63904.583 9161.009
Variable importance rankings in both models reinforce the dominance
of magnitude (mag), which has a significantly higher score
than the other variables. Path length (len) is the second
most influential factor, followed by property loss (loss).
Tornado width (wid) has a minor role, suggesting that its
contribution to injury predictions is less direct. These rankings
highlight the importance of focusing on magnitude and path length when
assessing tornado injury risks.
The decision tree visualizations clearly illustrate key thresholds in
the data. For mag >= 4, injuries escalate, particularly
when combined with long paths (len >= 75) or high
property loss (loss >= 23). Tornadoes with
mag < 4 result in minimal injuries regardless of other
variables. These thresholds provide actionable insights for disaster
preparedness, emphasizing the need to monitor and respond aggressively
to high-magnitude and long-path tornadoes.
library(pROC)
library(caret)
set.seed(123)
cleaned$long_path_binary <- factor(ifelse(cleaned$len > quantile(cleaned$len, 0.75), 1, 0),
levels = c(0, 1), labels = c("No", "Yes"))
# Set up Cross-Validation
cv_control <- trainControl(
method = "cv",
number = 10,
classProbs = TRUE,
summaryFunction = twoClassSummary
)
# Train Logistic Regression Model with Cross-Validation
cv_logistic_model <- train(
long_path_binary ~ slat + slon + mag * wid,
data = cleaned,
method = "glm",
family = binomial(),
metric = "ROC",
trControl = cv_control
)
cv_logistic_model
## Generalized Linear Model
##
## 29507 samples
## 4 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 26556, 26557, 26556, 26557, 26556, 26556, ...
## Resampling results:
##
## ROC Sens Spec
## 0.835001 0.9543641 0.3930859
The cross-validation step evaluates the model’s ability to predict long-path tornadoes. The 10-fold cross-validation process splits the data into training and testing subsets to ensure that the model generalizes well to unseen data. The ROC value of 0.835 indicates that the model has a good discriminative ability to differentiate between long-path and short-path tornadoes. This is complemented by high sensitivity (0.954) showing that most long-path tornadoes are correctly identified, while specificity (0.393) highlights room for improvement in minimizing false positives.
cv_results <- cv_logistic_model$results
cv_results
## parameter ROC Sens Spec ROCSD SensSD SpecSD
## 1 none 0.835001 0.9543641 0.3930859 0.006624361 0.004754727 0.01020199
roc_curve <- roc(
cleaned$long_path_binary,
predict(cv_logistic_model, newdata = cleaned, type = "prob")[, "Yes"]
)
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
plot(roc_curve, main = "ROC Curve with Cross-Validation", col = "blue", lwd = 2)
abline(a = 0, b = 1, lty = 2, col = "gray")
The ROC curve visually confirms the model’s performance, with the curve substantially above the diagonal line (representing random chance). The area under the curve (AUC = 0.835) further reinforces the logistic regression’s capability to classify tornado paths effectively. While sensitivity is strong, specificity should be revisited, perhaps by balancing the dataset or exploring more feature interactions.
library(car)
vif(cv_logistic_model$finalModel)
## slat slon mag wid `mag:wid`
## 1.045057 1.055685 2.108353 2.683890 4.213116
summary(cv_logistic_model$finalModel)
##
## Call:
## NULL
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.9785554 0.2067003 -9.572 < 2e-16 ***
## slat 0.0120309 0.0033344 3.608 0.000308 ***
## slon 0.0085452 0.0020045 4.263 2.02e-05 ***
## mag 0.7974274 0.0306795 25.992 < 2e-16 ***
## wid 0.0054422 0.0001725 31.556 < 2e-16 ***
## `mag:wid` -0.0006512 0.0001119 -5.820 5.87e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 33182 on 29506 degrees of freedom
## Residual deviance: 25463 on 29501 degrees of freedom
## AIC: 25475
##
## Number of Fisher Scoring iterations: 6
The variable importance analysis highlights significant predictors:
Magnitude (mag): Tornado magnitude is the strongest predictor, with higher magnitudes leading to an increased likelihood of long paths.
Width (wid): Wider tornadoes significantly contribute to predicting long-path events, consistent with their destructive nature.
Latitude (slat): Tornadoes at higher latitudes tend to have longer paths.
Longitude (slon): Tornadoes further east also predict longer paths.
Interaction term (mag:wid): Indicates that the combination of magnitude and width is vital, possibly reflecting compound effects on destruction.
The variance inflation factor (VIF) values are all below the commonly accepted threshold of 5, suggesting no severe multicollinearity among predictors. This ensures model stability and interpretability.
library(MASS)
library(ggplot2)
tornado_widths <- cleaned$wid[!is.na(cleaned$wid)]
exp_fit <- fitdistr(tornado_widths, "exponential")
cat("MLE Estimated Rate (Lambda):", exp_fit$estimate, "\n")
## MLE Estimated Rate (Lambda): 0.007025694
This analysis examines whether tornado widths follow an exponential distribution by using Maximum Likelihood Estimation (MLE) to estimate the rate parameter (λ). The estimated rate parameter (λ) is 0.0070, indicating the rapid decline in the density of larger tornado widths. The exponential distribution assumes that events (in this case, tornado widths) decrease in frequency exponentially as their magnitude increases. This aligns with the expectation that narrower tornadoes are much more frequent than wider ones.
To investigate whether there are significant differences in fatalities and injuries across seasons (Winter, Spring, Summer, Fall), a non-parametric method was used here to compare the medians of loss and fatalities across different seasons
To evaluate the seasonal differences in tornado impacts,
Kruskal-Wallis tests were conducted for both loss and fatalities across
seasons. The null hypothesis for each test is that the median values of
loss and fatalities are the same across all seasons, while the
alternative hypothesis states that at least one season has a
significantly different median. Using the cleaned dataset, which
includes the categorical variable season and the continuous
variables loss and fat (fatalities), the test
statistics and p-values were calculated to assess the significance of
these differences at a 0.05 significance level. This approach allows for
a robust comparison of median values across seasons, even with
non-normally distributed data.
cleaned=cleaned |>
mutate(
# Create season as a factor based on the month
season = factor(case_when(
mo %in% c(12, 1, 2) ~ "Winter",
mo %in% c(3, 4, 5) ~ "Spring",
mo %in% c(6, 7, 8) ~ "Summer",
mo %in% c(9, 10, 11) ~ "Fall"
), levels = c("Winter", "Spring", "Summer", "Fall")),
# Convert mag to a factor
mag = as.factor(mag)
)
# Kruskal-Wallis test for loss across seasons
kruskal_loss=kruskal.test(loss ~ season, data = cleaned)
print(kruskal_loss)
##
## Kruskal-Wallis rank sum test
##
## data: loss by season
## Kruskal-Wallis chi-squared = 1162.5, df = 3, p-value < 2.2e-16
# test for fatalities across seasons
kruskal_fatalities=kruskal.test(fat~ season, data = cleaned)
print(kruskal_fatalities)
##
## Kruskal-Wallis rank sum test
##
## data: fat by season
## Kruskal-Wallis chi-squared = 146.92, df = 3, p-value < 2.2e-16
The test result shows significant seasonal variations in both economic losses and fatalities. For economic losses, the p-value was less than 0.05, indicating strong evidence to reject the null hypothesis that losses are evenly distributed across all seasons. Similarly, the p-value for fatalities was less than 0.05, also providing strong evidence to reject the null hypothesis and conclude that fatalities vary significantly across seasons.
To further identify specific seasonal differences, post-hoc test were then utilized. The Dunn’s test with Bonferroni adjustment was then conducted
# Post-hoc test for loss across seasons
dunn_loss=dunnTest(loss ~ season, data = cleaned, method = "bonferroni")
print(dunn_loss)
## Dunn (1964) Kruskal-Wallis multiple comparison
## p-values adjusted with the Bonferroni method.
## Comparison Z P.unadj P.adj
## 1 Fall - Spring 6.664038 2.664053e-11 1.598432e-10
## 2 Fall - Summer 20.528670 1.193991e-93 7.163947e-93
## 3 Spring - Summer 19.318850 3.728658e-83 2.237195e-82
## 4 Fall - Winter -12.277059 1.202927e-34 7.217560e-34
## 5 Spring - Winter -20.128608 4.144714e-90 2.486828e-89
## 6 Summer - Winter -31.691891 2.009520e-220 1.205712e-219
In the post-hoc Dunn’s test for median based losses, all pairwise comparisons between seasons show significant differences as the p values for them are much smaller than 0.05. And it is exhibited that winter shows significantly higher. Fall has higher losses compared to spring and summer, and winter generally experiences the highest median of losses compared to all other seasons. Given the significant seasonal trends in losses, it is crucial to focus more on Fall and Winter tornadoes when addressing economic impacts even if the tornadoes are not most frequently occurring in these seasons
# Post-hoc test for injuries across seasons
dunn_fatalities=dunnTest(fat~ season, data = cleaned, method = "bonferroni")
print(dunn_fatalities)
## Dunn (1964) Kruskal-Wallis multiple comparison
## p-values adjusted with the Bonferroni method.
## Comparison Z P.unadj P.adj
## 1 Fall - Spring -0.5929891 5.531884e-01 1.000000e+00
## 2 Fall - Summer 5.3279154 9.934636e-08 5.960781e-07
## 3 Spring - Summer 7.9963062 1.282073e-15 7.692437e-15
## 4 Fall - Winter -6.0488587 1.458755e-09 8.752529e-09
## 5 Spring - Winter -6.5895148 4.412662e-11 2.647597e-10
## 6 Summer - Winter -11.4469857 2.434651e-30 1.460791e-29
Regarding the comparisons of median based fatalities between different seasons, the adjusted p value shows the statistically significant differences for all except the fall and spring, indicating the approximately similar median fatalities between these two seasons. Based on the observations of other comparisons,winter season shows relatively higher fatalities compare to other seasons, followed by summer. Therefore, more preventive measures should be taken during Summer and Winter to address the heightened risks of fatalities.
To examine whether there is an association between the variables
mag (magnitude) and fat (fatalities) in the
dataset, a Chi-Square test of independence was performed. The Null
Hypothesis (H0) states that the magnitude of the tornadoes and
fatalities are independent. And the Alternative Hypothesis (H1) states
that there is an association between the magnitude of tornado events and
numbers of fatalities.
table_data=table(cleaned$mag, cleaned$fat)
# Perform Chi-Square Test
chisq.test(table_data)
##
## Pearson's Chi-squared test
##
## data: table_data
## X-squared = 25983, df = 162, p-value < 2.2e-16
The result shows that the p-value is far below the standard significance level (e.g., 0.05), so we reject the null hypothesis, indicating that there is a statistically significant association between the magnitude of the event and the number of fatalities. This result statistically confirms the observations initially identified during the exploratory data analysis (EDA), where trends suggested that higher magnitudes are likely to result in more fatalities.
The linear regression model was then developed to analyze the
relationship between fatalities (fat) as the dependent variable and
multiple predictors: magnitude (mag), injuries (inj), loss (loss),
length (len), and width (wid). And diagnostics were conducted to
evaluate the assumptions of linear regression, including linearity,
normality, homogeneity of variance, and outliers. The
check_model() function was used to evaluate four
assumptions of linear regression model: Linearity, Normality (QQ Plot),
Homogeneity of Variance (Homoscedasticity) and Influential
Observations
# Prepare the model specification
lm_spec =linear_reg()|>
set_mode("regression")|>
set_engine("lm")
final_model=lm(fat ~ mag + inj + loss + len + wid, data = cleaned)
check_model(final_model, check = c("linearity", "qq", "homogeneity", "outliers"))
With the diagnostic observations from the linear regression model, a more flexible modeling approach was implemented using polynomial terms and interaction effects to better capture the nonlinear relationships and reduce the influence of heteroscedasticity and non linearity.
Model 1 was selected with the following predictors as some of them were tested and observed in the previous analysis:
fat ~ poly(mag, 2) + poly(inj, 2) + poly(loss, 2) + poly(len, 2) + poly(wid, 2),
second-order polynomial terms for all predictors (mag, inj, loss, len,
and wid) without interaction terms.# Fit the regression model
model=lm(fat ~ poly(mag, 2) + poly(inj, 2) + poly(loss, 2) + poly(len, 2) + poly(wid, 2), data = cleaned)
# Tidy the model results and print in a neat table
broom::tidy(model) |>
knitr::kable(digits = 3, caption = "Regression Model Results with Polynomial Terms")
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.060 | 0.004 | 13.334 | 0.000 |
| poly(mag, 2)1 | 8.860 | 1.052 | 8.422 | 0.000 |
| poly(mag, 2)2 | 15.400 | 0.855 | 18.015 | 0.000 |
| poly(inj, 2)1 | 161.696 | 0.858 | 188.433 | 0.000 |
| poly(inj, 2)2 | -2.946 | 0.858 | -3.435 | 0.001 |
| poly(loss, 2)1 | -11.555 | 0.780 | -14.815 | 0.000 |
| poly(loss, 2)2 | 3.483 | 0.775 | 4.495 | 0.000 |
| poly(len, 2)1 | 8.696 | 1.028 | 8.462 | 0.000 |
| poly(len, 2)2 | 27.108 | 0.857 | 31.628 | 0.000 |
| poly(wid, 2)1 | 1.812 | 1.029 | 1.760 | 0.078 |
| poly(wid, 2)2 | -1.252 | 0.827 | -1.515 | 0.130 |
The comparison model was used to test if simplifying the model by using fewer polynomial terms and focusing on interactions can improve generalization and further improve the performance of the prediction. So Model 2:
fat ~ poly(mag, 2) + inj:loss + poly(len, 2) + poly(wid, 2),
second-order polynomial terms for mag, len, and wid, as well as an
interaction term between inj and loss.The dataset was split into training and testing sets for cross-validation, with each model trained on the training set and evaluated on the corresponding test set. Prediction accuracy was assessed using the Root Mean Squared Error (RMSE). And the violin plot is used to visualize the rmse of using cross-validation
cv_df <- crossv_mc(cleaned, 100) |>
mutate(
# Convert train and test sets to tibbles
train = map(train, as_tibble),
test = map(test, as_tibble)
)
# Fit models and calculate metrics
cv_results=cv_df |>
mutate(
# Fit models with polynomial terms
model1_mod = map(train, \(df) lm(fat ~ poly(mag, 2) + poly(inj, 2) + poly(loss, 2) + poly(len, 2) + poly(wid, 2), data = df)),
model2_mod = map(train, \(df) lm(fat ~ poly(mag, 2) + inj:loss + poly(len, 2) + poly(wid, 2), data = df)),
) |>
mutate(
# Calculate RMSE for each model on the test sets
rmse_model1 = map2_dbl(model1_mod, test, \(mod, df) rmse(model = mod, data = df)),
rmse_model2 = map2_dbl(model2_mod, test, \(mod, df) rmse(model = mod, data = df))
)
# Combine metrics for all models
summary_results=cv_results |>
summarise(
mean_rmse_model1 = mean(rmse_model1),
mean_rmse_model2 = mean(rmse_model2)
)
print(summary_results)
## # A tibble: 1 × 2
## mean_rmse_model1 mean_rmse_model2
## <dbl> <dbl>
## 1 0.718 1.14
cv_long=cv_results |> select(starts_with("rmse"))|>
pivot_longer(
everything(),
names_to = "model",
values_to = "rmse",
names_prefix = "rmse_") |>
mutate(model = fct_inorder(model))
cv_long|>ggplot(aes(x = model, y = rmse)) +
geom_violin()+
labs(
title = "Cross-Validated RMSE for fatality",
x = "Model",
y = "RMSE"
) +
theme_minimal()
It is observed that the better performance of the first model
model1_mod using cross-validation is slightly better
compared to the second model model2_mod, with the RMSE
increase from about 0.71 to 1.12.